187 research outputs found
Entity matching with transformer architectures - a step forward in data integration
Transformer architectures have proven to be very effective and provide state-of-the-art results in many natural language tasks. The attention-based architecture in combination with pre-training on large amounts of text lead to the recent breakthrough and a variety of slightly different implementations.
In this paper we analyze how well four of the most recent attention-based transformer architectures (BERT, XLNet, RoBERTa and DistilBERT) perform on the task of entity matching - a crucial part of data integration. Entity matching (EM) is the task of finding data instances that refer to the same real-world entity. It is a challenging task if the data instances consist of long textual data or if the data instances are "dirty" due to misplaced values.
To evaluate the capability of transformer architectures and transfer-learning on the task of EM, we empirically compare the four approaches on inherently difficult data sets. We show that transformer architectures outperform classical deep learning methods in EM by an average margin of 27.5%
Data Grid tutorials with hands-on experience
Grid technologies are more and more used in scientific as well as in industrial environments but often documentation and the correct usage are either not sufficient or not too well understood. Comprehensive training with hands-on experience helps people first to understand the technology and second to use it in a correct and efficient way. We have organised and run several training sessions in different locations all over the world and provide our experience. The major factors of success are a solid base of theoretical lectures and, more dominantly, a facility that allows for practical Grid exercises during and possibly after tutorial sessions
Data Science für Lehre, Forschung und Praxis
Erworben im Rahmen der Schweizer Nationallizenzen (http://www.nationallizenzen.ch)Data Science ist in aller Munde. Nicht nur wird an Konferenzen zu Big Data, Cloud Computing oder Data Warehousing darüber gesprochen: Glaubt man dem McKinsey Global Institute, so wird es alleine in den USA in den nächsten Jahren eine Lücke von bis zu 190.000 Data Scientists geben. In diesem Kapitel beleuchten wir daher zunächst die Hintergründe des Begriffs Data Science. Dann präsentieren wir typische Anwendungsfälle und Lösungsstrategien auch aus dem Big Data Umfeld. Schließlich zeigen wir am Beispiel des Diploma of Advanced Studies in Data Science der ZHAW Möglichkeiten auf, selber aktiv zu werden
SODA: Generating SQL for Business Users
The purpose of data warehouses is to enable business analysts to make better
decisions. Over the years the technology has matured and data warehouses have
become extremely successful. As a consequence, more and more data has been
added to the data warehouses and their schemas have become increasingly
complex. These systems still work great in order to generate pre-canned
reports. However, with their current complexity, they tend to be a poor match
for non tech-savvy business analysts who need answers to ad-hoc queries that
were not anticipated. This paper describes the design, implementation, and
experience of the SODA system (Search over DAta Warehouse). SODA bridges the
gap between the business needs of analysts and the technical complexity of
current data warehouses. SODA enables a Google-like search experience for data
warehouses by taking keyword queries of business users and automatically
generating executable SQL. The key idea is to use a graph pattern matching
algorithm that uses the metadata model of the data warehouse. Our results with
real data from a global player in the financial services industry show that
SODA produces queries with high precision and recall, and makes it much easier
for business users to interactively explore highly-complex data warehouses.Comment: VLDB201
Lessons learned from challenging data science case studies
In this chapter, we revisit the conclusions and lessons learned of the chapters presented in Part II of this book and analyze them systematically. The goal of the chapter is threefold: firstly, it serves as a directory to the individual chapters, allowing readers to identify which chapters to focus on when they are interested either in a certain stage of the knowledge discovery process or in a certain data science method or application area. Secondly, the chapter serves as a digested, systematic summary of data science lessons that are relevant for data science practitioners. And lastly, we reflect on the perceptions of a broader public towards the methods and tools that we covered in this book and dare to give an outlook towards the future developments that will be influenced by them
Toward automatic data curation for open data
In recent years large amounts of data have been made publicly available: literally thousands of open data sources exist, with genome data, temperature measurements, stock market prices, population and income statistics etc. However, accessing and combining data from different data sources is both non-trivial and very time consuming. These tasks typically take up to 80% of the time of data scientists. Automatic integration and curation of open data can facilitate this process
Is Your Learned Query Optimizer Behaving As You Expect? A Machine Learning Perspective
The current boom of learned query optimizers (LQO) can be explained not only
by the general continuous improvement of deep learning (DL) methods but also by
the straightforward formulation of a query optimization problem (QOP) as a
machine learning (ML) one. The idea is often to replace dynamic programming
approaches, widespread for solving QOP, with more powerful methods such as
reinforcement learning. However, such a rapid "game change" in the field of QOP
could not pass without consequences - other parts of the ML pipeline, except
for predictive model development, have large improvement potential. For
instance, different LQOs introduce their own restrictions on training data
generation from queries, use an arbitrary train/validation approach, and
evaluate on a voluntary split of benchmark queries.
In this paper, we attempt to standardize the ML pipeline for evaluating LQOs
by introducing a new end-to-end benchmarking framework. Additionally, we guide
the reader through each data science stage in the ML pipeline and provide novel
insights from the machine learning perspective, considering the specifics of
QOP. Finally, we perform a rigorous evaluation of existing LQOs, showing that
PostgreSQL outperforms these LQOs in almost all experiments depending on the
train/test splits
Database search vs. information retrieval : a novel method for studying natural language querying of semi-structured data
The traditional approach of querying a relational database is via a formal language, namely SQL. Recent developments in the design of natural language interfaces to databases show promising results for querying either with keywords or with full natural language queries and thus render relational databases more accessible to non-tech savvy users. Such enhanced relational databases basically use a search paradigm which is commonly used in the field of information retrieval. However, the way systems are evaluated in the database and the information retrieval communities often differs due to a lack of common benchmarks. In this paper, we provide an adapted benchmark data set that is based on a test collection originally used to evaluate information retrieval systems. The data set contains 45 information needs developed on the Internet Movie Database (IMDb), including corresponding relevance assessments. By mapping this benchmark data set to a relational database schema, we enable a novel way of directly comparing database search techniques with information retrieval. To demonstrate the feasibility of our approach, we present an experimental evaluation that compares SODA, a keyword-enabled relational database system, against the Terrier information retrieval system and thus lays the foundation for a future discussion of evaluating database systems that support natural language interfaces
- …